Dependency Treebank of Urdu and its Evaluation

نویسندگان

  • Riyaz Ahmad Bhat
  • Dipti Misra Sharma
چکیده

In this paper we describe a currently underway treebanking effort for Urdu-a South Asian language. The treebank is built from a newspaper corpus and uses a Karaka based grammatical framework inspired by Paninian grammatical theory. Thus far 3366 sentences (0.1M words) have been annotated with the linguistic information at morpho-syntactic (morphological, part-of-speech and chunk information) and syntactico-semantic (dependency) levels. This work also aims to evaluate the correctness or reliability of this manual annotated dependency treebank. Evaluation is done by measuring the inter-annotator agreement on a manually annotated data set of 196 sentences (5600 words) annotated by two annotators. We present the qualitative analysis of the agreement statistics and identify the possible reasons for the disagreement between the annotators. We also show the syntactic annotation of some constructions specific to Urdu like Ezafe and discuss the problem of word segmentation (tokenization).

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Data-Driven Dependency Parser for Urdu

One of the main motivations for building treebanks is that they facilitate the development of syntactic parsers, by providing realistic data for evaluation as well as inductive learning. In this paper we present what we believe to be the first data-driven dependency parser for Urdu, which has developed using MaltParser system and trained and evaluated on data from Urdu dependency treebank. A 40...

متن کامل

Exploiting Language Variants Via Grammar Parsing Having Morphologically Rich Information

In this paper, the development and evaluation of the Urdu parser is presented along with the comparison of existing resources for the language variants Urdu/Hindi. This parser was given a linguistically rich grammar extracted from a treebank. This context free grammar with sufficient encoded information is comparable with the state of the art parsing requirements for morphologically rich and cl...

متن کامل

Urdu Dependency Parser: A Data-Driven approach

In this paper, we present what we believe to be the first data-driven dependency parser for Urdu. The parser was trained and tuned using MaltParser system, a system for data-driven dependency parsing. The Urdu dependency treebank (UDT) is used for training and testing of the Urdu dependency parser, is also presented first time. The UDT contains corpus of 2853 sentences which are annotated at mu...

متن کامل

A Multi-Representational and Multi-Layered Treebank for Hindi/Urdu

This paper describes the simultaneous development of dependency structure and phrase structure treebanks for Hindi and Urdu, as well as a PropBank. The dependency structure and the PropBank are manually annotated, and then the phrase structure treebank is produced automatically. To ensure successful conversion the development of the guidelines for all three representations are carefully coordin...

متن کامل

Building Computational Resources: The URDU.KON-TB Treebank and the Urdu Parser

This work presents the development of the URDU.KON-TB treebank, its annotation evaluation & guidelines and the construction of the Urdu parser for a South Asian language Urdu. Urdu is comparatively an under-resourced language and the development of a reliable treebank and a parser will have significant impact on the state-of-the-art for automatic Urdu language processing. The work includes the ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012